Description
This R notebook accompanies the MIC workshop: Modern data analysis in R/RStudio. It contains placeholders (...) to fill the code in exercises presented during the workshop.
You can knit this document into an HTML file. Once you fill in the code, change the parameter eval=F to eval=T in every code snippet. This way, the output of every code section will appear in the notebook.
R primer
The R for data Science book and the accompanying website is an excellent resource to get acquainted with R and its applications to data science.
Here is an introduction to data.table package that we use extensively in this analysis. Some advanced tips and tricks with data.table are covered here.
Data description
Data come from a microfluidic experiment where PC-12 cells were stimulated with 3 different growth factors (EGF, NGF, FGF2) at different concentrations. It comes from our recent publication Temporal perturbation of ERK dynamics reveals network architecture of FGF2-MAPK signaling (2019) available here.
The microfluidic device has four separate channels, thus 4 different treatments can be performed in a single experiment.
The PC-12 cells express a FRET biosensor to indicate activity of the ERK kinase.
FRET biosensor
ERK is an end point of the MAPK signalling pathway.
MAPK pathway
Three different growth factors tested activate different parts of the MAPK network and result in different signalling dynamics, i.e. the behaviour of ERK activity over time.
Cells stimulated with growth factors were then imaged with a single-cell resolution over 3-4 hours at a 2-minute time resolution. Each channel of a microfluidic device was imaged at 4 different fields of view. Thus, we have 16 fields of view per experiment.
The folder data contains 3 sets of files that correspond to 3 growth factor treatments at 4 different concentrations. A single set from a single experiment includes:
tCoursesSelected_xGFsust_expXXXX.csv.gz a compressed file with single-cell data. Each row corresponds to the measurement of ERK activity in a single cell, at a single time point. Columns:
Metadata_Seriesan integer, 0-15, that corresponds to the field of view.Metadata_Tan integer, 0-100, corresponds to the time frame number in a movie from a single field of view.TrackObjects_Labelan integer that describes the track number from object tracking algorithm. The number is unique only within a single field of view.Intensity_MeanIntensity_Ratioa float with the measurement of FRET ratio; this is our proxy of ERK activity.
experimentDescription_EGFsust.xlsx an Excel file with experiment description. Columns:
Positionan integer, 0-15, with the number of the field of view.Channelan integer, 1-4, with the number of the microfluidic channel.Stim_Conca string with an indication of the GF concentration.
Data files
Exercise 1: read data
Read single-cell data
Read data from a single experiment from this file ../data/original/tCoursesSelected_EGFsust_exp20170316.csv.gz. Use data.table::fread function.
# load required R packages
require(R.utils)
require(data.table)
# read the file
dtEGF = fread("../data/original/tCoursesSelected_EGFsust_exp20170316.csv.gz")
# view first couple of lines of this dataset
head(dtEGF) Metadata_Series Metadata_T TrackObjects_Label Intensity_MeanIntensity_Ratio
1: 0 0 3 1229.246
2: 0 0 4 1279.513
3: 0 0 8 1236.920
4: 0 0 12 1170.460
5: 0 0 13 1150.294
6: 0 0 15 1128.049
Read experimental description
The experimental description is located in this Excel file ../data/original/experimentDescription_EGFsust.xlsx.
Use the readxl::read_xlsx function.
require(readxl)
# the as.data.table function converts the tibble returned by read_xlsx function
# to a `data.table`.
dtEGFexp = as.data.table(readxl::read_xlsx("../data/original/experimentDescription_EGFsust.xlsx"))
head(dtEGFexp) Position Channel Stim_Duration Stim_Conc Stim_Treat
1: 0 1 sust 250 ng/ml EGF
2: 1 1 sust 250 ng/ml EGF
3: 2 1 sust 250 ng/ml EGF
4: 3 1 sust 250 ng/ml EGF
5: 4 2 sust 25 ng/ml EGF
6: 5 2 sust 25 ng/ml EGF
Acquisition_frequency_min Equilibration_min Device_type
1: 2 43 STD 4-channel
2: 2 43 STD 4-channel
3: 2 43 STD 4-channel
4: 2 43 STD 4-channel
5: 2 43 STD 4-channel
6: 2 43 STD 4-channel
Merge data with experimental description
Add columns Stim_Treat and Stim_Conc with experiment description to single-cell data using merge function. Identify common columns between the two tables that will be used for merging.
dtEGFwithExp = merge(dtEGF, dtEGFexp, by.x = "Metadata_Series", by.y = "Position")
head(dtEGFwithExp) Metadata_Series Metadata_T TrackObjects_Label Intensity_MeanIntensity_Ratio
1: 0 0 3 1229.246
2: 0 0 4 1279.513
3: 0 0 8 1236.920
4: 0 0 12 1170.460
5: 0 0 13 1150.294
6: 0 0 15 1128.049
Channel Stim_Duration Stim_Conc Stim_Treat Acquisition_frequency_min
1: 1 sust 250 ng/ml EGF 2
2: 1 sust 250 ng/ml EGF 2
3: 1 sust 250 ng/ml EGF 2
4: 1 sust 250 ng/ml EGF 2
5: 1 sust 250 ng/ml EGF 2
6: 1 sust 250 ng/ml EGF 2
Equilibration_min Device_type
1: 43 STD 4-channel
2: 43 STD 4-channel
3: 43 STD 4-channel
4: 43 STD 4-channel
5: 43 STD 4-channel
6: 43 STD 4-channel
Read file list
The table dtEGFwithExp created above contains single-cell data with treatment information (i.e. growth factor and its concentration) for a single growth factor only! Now, let’s read all files in the ../data folder corresponding to three growth factor treatments.
- Create a vector with raw file names in a folder
[1] "../data/original/tCoursesSelected_EGFsust_exp20170316.csv.gz"
[2] "../data/original/tCoursesSelected_FGFsust_exp20170224.csv.gz"
[3] "../data/original/tCoursesSelected_NGFsust_exp20170725.csv.gz"
- Apply an
freadfunction to read individual files
- Name list elements according to file names they were read from.
These names will be converted to an ID column when binding the list with rbindlist.
- Bind the list.
Note that tables read from individual files have different column numbers. Therefore, when binding, we specify to option use.names = T to match column between different list elements, and fill = T to fill non-existing columns with NA`.
In addition, we use the option idcol = "fileName" to create a column that contains names of list elements that we assigned in the previous step.
dtAll = rbindlist(lAll, use.names = T, fill = T, idcol = "fileName")
# Remove unnecessary columns; they exist only in 2 out of 3 of the files
dtAll[, `:=`(c("Location_Center_X", "Location_Center_Y"), NULL)]- Extract growth factor name from the
fileNamecolumn.
gsub is a handy function to extract strings using regular expressions.
dtAll[, `:=`(Stim_Treat, gsub(".*_(.*)sust_.*", "\\1", fileName))]
# remove the fileName column
dtAll[, `:=`(fileName, NULL)]The dtAll table is still missing information about the growth factor concentration. To add that, we need to read ../data/original/experimentDescription_* files.
Milestone 1
Data from all experiments merged into a single file:
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: EGF 0 0 3
2: EGF 0 0 4
3: EGF 0 0 8
4: EGF 0 0 12
5: EGF 0 0 13
6: EGF 0 0 15
Intensity_MeanIntensity_Ratio Channel Stim_Conc
1: 1229.246 1 250 ng/ml
2: 1279.513 1 250 ng/ml
3: 1236.920 1 250 ng/ml
4: 1170.460 1 250 ng/ml
5: 1150.294 1 250 ng/ml
6: 1128.049 1 250 ng/ml
Before proceeding further, execute the cell below. It contains definitions of column names that we will use throughout this workshop.
# create a list with strings for column names
lCol = list(
fov = "Metadata_Series",
frame = "Metadata_T",
trackid = "TrackObjects_Label",
meas = "Intensity_MeanIntensity_Ratio",
measNorm = "Intensity_MeanIntensity_Ratio_Norm",
ch = "Channel",
treatGF = "Stim_Treat",
treatConc = "Stim_Conc",
trackiduni = "TrackObjects_Label_Uni", # we will add this column later
clid = "Cluster_Label" # we will add this column later
)Exercise 2: unique time series ID’s
The track ID column is unique only within a single field of view, which is given in the Metadata_Series column.
Create a column with unique time series ID’s that includes the treatment, field of view, and track ID. It will become quite handy later on during subsetting and clustering. Use paste, paste0, or sprintf functions to concatenate strings from different columns.
dtAll[, `:=`((lCol$trackiduni), sprintf("%s_%02d_%04d", get(lCol$treatGF), get(lCol$fov),
get(lCol$trackid)))]
head(dtAll) Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: EGF 0 0 3
2: EGF 0 0 4
3: EGF 0 0 8
4: EGF 0 0 12
5: EGF 0 0 13
6: EGF 0 0 15
Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1: 1229.246 1 250 ng/ml EGF_00_0003
2: 1279.513 1 250 ng/ml EGF_00_0004
3: 1236.920 1 250 ng/ml EGF_00_0008
4: 1170.460 1 250 ng/ml EGF_00_0012
5: 1150.294 1 250 ng/ml EGF_00_0013
6: 1128.049 1 250 ng/ml EGF_00_0015
Data exploration
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
Length:120086 Min. : 0.000 Min. : 0.00 Min. : 1.00
Class :character 1st Qu.: 3.000 1st Qu.: 25.00 1st Qu.:11.00
Mode :character Median : 7.000 Median : 50.00 Median :22.00
Mean : 7.477 Mean : 49.99 Mean :23.42
3rd Qu.:12.000 3rd Qu.: 75.00 3rd Qu.:34.00
Max. :15.000 Max. :100.00 Max. :80.00
Intensity_MeanIntensity_Ratio Channel Stim_Conc
Min. : 677.9 Min. :1.000 Length:120086
1st Qu.:1196.4 1st Qu.:1.000 Class :character
Median :1246.3 Median :2.000 Mode :character
Mean :1257.3 Mean :2.485
3rd Qu.:1311.9 3rd Qu.:4.000
Max. :4060.9 Max. :4.000
NA's :18
TrackObjects_Label_Uni
Length:120086
Class :character
Mode :character
Missing data
To obtain rows with missing data in the Intensity_MeanIntensity_Ratio column, use is.na function. It returns TRUE, if the argument is NA.
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: FGF 8 11 1
2: FGF 8 11 2
3: FGF 8 11 5
4: FGF 8 11 6
5: FGF 8 11 7
6: FGF 8 11 8
Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1: NA 3 2.5 ng/ml FGF_08_0001
2: NA 3 2.5 ng/ml FGF_08_0002
3: NA 3 2.5 ng/ml FGF_08_0005
4: NA 3 2.5 ng/ml FGF_08_0006
5: NA 3 2.5 ng/ml FGF_08_0007
6: NA 3 2.5 ng/ml FGF_08_0008
Check measurements of a single time series. Note that the two operations on the data.table have been chained, i.e. dtAll[][]. First, we selected all rows of dtAll which have the TrackObjects_Label_Uni column equal to FGF_08_0001, then we select Intensity_MeanIntensity_Ratio column from that result. The output is a vector.
[1] 1154.882 1153.657 1150.764 1147.915 1140.173 1142.111
Check the length of all other time series with NA’s.
First, let’s create a vector with TrackObjects_Label_Uni that contain NA's in the Intensity_MeanIntensity_Ratio column. Again, we’re chaining data.table operations.
# make a vector with strings of unique track ids of time series that contain NA's
vsTracksWithNAs = unique(dtAll[is.na(get(lCol$meas))][[lCol$trackiduni]])
vsTracksWithNAs [1] "FGF_08_0001" "FGF_08_0002" "FGF_08_0005" "FGF_08_0006" "FGF_08_0007"
[6] "FGF_08_0008" "FGF_08_0010" "FGF_08_0011" "FGF_08_0014" "FGF_08_0017"
[11] "FGF_08_0019" "FGF_08_0020" "FGF_08_0024" "FGF_08_0021" "FGF_08_0023"
[16] "FGF_08_0022" "FGF_08_0025" "FGF_08_0027"
The unique function takes only unique elements of its argument. This is to ensure that even if a single track contains multiple NA's we take its TrackObjects_Label_Uni only once.
Second, we subset the dtAll to obtain rows where TrackObjects_Label_Uni belongs to elements of a vsTracksWithNAs vector.
One option is to use %in% operator:
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: FGF 8 0 1
2: FGF 8 0 2
3: FGF 8 0 5
4: FGF 8 0 6
5: FGF 8 0 7
6: FGF 8 0 8
Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1: 1154.882 3 2.5 ng/ml FGF_08_0001
2: 1116.447 3 2.5 ng/ml FGF_08_0002
3: 1250.728 3 2.5 ng/ml FGF_08_0005
4: 1204.617 3 2.5 ng/ml FGF_08_0006
5: 1206.455 3 2.5 ng/ml FGF_08_0007
6: 1159.064 3 2.5 ng/ml FGF_08_0008
Another, faster option is to set the key in dtAll to the TrackObjects_Label_Uni column:
# Key the table according to a unique track ID
setkeyv(dtAll, lCol$trackiduni)
head(dtAll[vsTracksWithNAs]) Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: FGF 8 0 1
2: FGF 8 1 1
3: FGF 8 2 1
4: FGF 8 3 1
5: FGF 8 4 1
6: FGF 8 5 1
Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1: 1154.882 3 2.5 ng/ml FGF_08_0001
2: 1153.657 3 2.5 ng/ml FGF_08_0001
3: 1150.764 3 2.5 ng/ml FGF_08_0001
4: 1147.915 3 2.5 ng/ml FGF_08_0001
5: 1140.173 3 2.5 ng/ml FGF_08_0001
6: 1142.111 3 2.5 ng/ml FGF_08_0001
Finally we can calculate the number of time points in tracks with NA's.
TrackObjects_Label_Uni N
1: FGF_08_0001 101
2: FGF_08_0002 101
3: FGF_08_0005 101
4: FGF_08_0006 101
5: FGF_08_0007 101
6: FGF_08_0008 101
Exercise 3: identify missing data
Calculate the number of time points per time series. Check the min and max. The .N calculates the number of rows:
Here, we chained two operations on a data.table. First, we calculated the number of rows per time series. This results in a new data.table. We then selected a single column named N from that table by using [[...]] syntax.
Use summary function to calculate the 5-point statistics.
Milestone 2
Dataset with interpolated NA's from Milestone 2:
require(R.utils)
require(data.table)
dtAll = fread("../data/m2_allGF_wExpDescr_noNAs.csv.gz")
head(dtAll) Stim_Treat Metadata_T Intensity_MeanIntensity_Ratio Stim_Conc
1: EGF 0 1183.396 0.25 ng/ml
2: EGF 1 1187.431 0.25 ng/ml
3: EGF 2 1172.491 0.25 ng/ml
4: EGF 3 1176.407 0.25 ng/ml
5: EGF 4 1172.421 0.25 ng/ml
6: EGF 5 1167.917 0.25 ng/ml
TrackObjects_Label_Uni
1: EGF_12_0002
2: EGF_12_0002
3: EGF_12_0002
4: EGF_12_0002
5: EGF_12_0002
6: EGF_12_0002
Plot single-cell data
require(ggplot2)
p0 = ggplot(dtAll, aes_string(x = lCol$frame, y = lCol$meas, group = lCol$trackiduni)) +
geom_path(alpha = 0.1) # parameter alpha adds transparency
p0Exercise 4A: remove point outliers
Remove single time points above the threshold 2000, and impute them with interpolated data. Replace outlier measurements with NA's and interpolate them with imputeTS::na_interpolation.
# replace time points with measurements above the threshold with NA's
dtAll[get(lCol$meas) > 2000, `:=`(c(lCol$meas), NA)]
# interpolate NA's
require(imputeTS)
dtAll[, `:=`((lCol$meas), imputeTS::na.interpolation(get(lCol$meas))), by = c(lCol$trackiduni)]Note, that interpolation has to be done independently by track. Doing that on a an entire column without grouping could result in interpolating the last NA of a track with a previous time point of that track and the first time point of the next track.
Exercise 4B: remove dynamics outliers
Remove entire trajectories if at least a single time point is below 1000. Create a vector with unique track ID’s for time series that have measurements below the threshold. Subset dtAll using that vector and %in% operator, or subsetting a keyed dtAll as above.
Exercise 4B: remove dynamics outliers
# vector with unique track id's of time series where measurements were below a
# threshold of 1000
vsTrackOut = unique(dtAll[get(lCol$meas) < 1000][[lCol$trackiduni]])
# Key the table according to a unique track ID column; allows for quick and easy
# subsetting
setkeyv(dtAll, lCol$trackiduni)
# leave only tracks that are NOT in the outlier vector
dtAll = dtAll[!vsTrackOut]
# clean
rm(vsTrackOut, vsTracksWithNAs, p0)Exercise 4C: plot cleaned dataset
ggplot(dtAll, aes_string(x = lCol$frame, y = lCol$meas, group = lCol$trackiduni)) +
geom_path(alpha = 0.1) # parameter alpha adds transparencyInteractive removal of outliers
There’s no single recipe for handling outliers; it all depends on the analysis.
A handy interactive tool written by Mauro Gwerder (a Bachelor student in Olivier Pertz lab) can help with that process. The R/Shiny app can be downloaded from here and executed from RStudio.
Outlier App GUI
Milestone 3
Dataset without outliers from Milestone 3:
Plot per condition
# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame,
y = lCol$meas,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) + # parameter alpha adds transparency
facet_grid(get(lCol$treatGF) ~ get(lCol$treatConc))
p0Exercise 5: normalisation
Add a column to dtAll with a normalised measurement where every time series is divided by the mean of its first 20 time points. Plot normalised data.
A column with the mean of first 20 elements of a group can be added this way:
.SD corresponds to a subset of data as defined by by section. It’s a temporary data.table and can be used as such.
# add a column with the mean of the beasline for every time series
dtAll[,
baseline := mean(.SD[1:20,
get(lCol$meas)]), # add a new column with the mean of first 20 rows of the group
by = c(lCol$trackiduni)] # group by unique trajectory
# add a column with normalized measurement
dtAll[,
(lCol$measNorm) := get(lCol$meas) / baseline]
# remove baseline column
dtAll[, baseline := NULL]
head(dtAll) Stim_Treat Metadata_T Intensity_MeanIntensity_Ratio Stim_Conc
1: EGF 0 1183.396 0.25 ng/ml
2: EGF 1 1187.431 0.25 ng/ml
3: EGF 2 1172.491 0.25 ng/ml
4: EGF 3 1176.407 0.25 ng/ml
5: EGF 4 1172.421 0.25 ng/ml
6: EGF 5 1167.917 0.25 ng/ml
TrackObjects_Label_Uni Intensity_MeanIntensity_Ratio_Norm
1: EGF_12_0002 1.0065661
2: EGF_12_0002 1.0099982
3: EGF_12_0002 0.9972902
4: EGF_12_0002 1.0006213
5: EGF_12_0002 0.9972307
6: EGF_12_0002 0.9933996
Milestone 4
Dataset with a column with normalised measurement.
Plot normalised data
Plot per condition using normalized data:
# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame,
y = lCol$measNorm,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) + # parameter alpha adds transparency
facet_grid(reformulate(lCol$treatGF, lCol$treatConc))
p0Add mean to the plot
Note, that we are adding a layer with the summary to the existing ggplot!
p1 = p0 + stat_summary(aes_string(y = lCol$measNorm, group = 1), fun = mean, geom = "line",
group = 1, colour = "red", linetype = "solid", size = 1)
p1Beautifying plots
Add themes, e.g. + theme_bw(), or themes available in packages such as ggthemes.
Use + labels() to add the title, subtitle, the caption.
Use + xlab() or + ylab() to control labels of x and y axes.
require(ggthemes)
p1 + theme_minimal() + labs(title = "ERK activity in response to sustained GF treatments",
caption = paste0("Created on ", Sys.Date())) + xlab("Time (min)") + ylab("ERK activity")